Using the Generic Document Profile to Cluster Similar Texts

نویسنده

  • Jeremy Ellman
چکیده

The World Wide Web contains a huge quantity of text that is notoriously inefficient to use. This work aims to apply a text processing technique based on thesaurally derived lexical chains to improve Internet Information Retrieval where a lexical chain a set of words in a text that are related by both proximity, and by relations derived from an external lexical knowledge source such as WordNet, Roget's Thesaurus, LDOCE, and so on. Finding Information on the Internet is notoriously hard, even when users have a clear focus to their queries. This situation is exacerbated when users only have vague notions about the topics they wish to explore. This could be remedied using Exemplar Texts, where an Exemplar Text is the ideal model result for Web searches. Our problem is now transformed into one of identifying similar texts. The Generic Document Profile is designed to allow the comparison of document similarity whilst being independent of terminology and document length. It is simply a set of semantic categories derived from Roget's thesaurus with associated weights. These weights are based on lexical chain length and strength. A Generic Document Profile can be compared to another using a Case Based Reasoning approach. Case Based Reasoning (CBR) is a problem solving method that seeks to solve existing problems by reference to previous successful solutions. Here our Exemplar Texts count as previous solutions (and in these experiments, MS Encarta is used as the source of Exemplar Texts). In CBR a query and examples are usually represented as attribute value pairs. Thus to apply CBR to document comparison, both the text acting as a query, and the documents to be compared against need to be represented in equivalent terms. If this representation is based on simple terms (ie word stems), the problem becomes hugely complex, since there are many words in a language. This would also be a fragile approach, since semantically equivalent words word not count as equal. However, If we represent a document as Roget categories, the problem becomes tractable. We use Roget's thesaurus since there are 1024 main thesaurus categories as opposed to 50000+ Synsets in Wordnet. This approach is providing interesting results, although it is subject to the well known Word Sense Disambiguation problem. The paper will summarise current findings, cover some implementation details, and describe to approach taken to full-text Word Sense Disambiguation. Hesperus is a system designed to address this problem. Using an electronic encyclopedia as a source of subject defining texts, queries are made to MetaCrawler. This searches several search engines in parallel returning the best 10-20 links. Hesperus retrieves these web pages, and computes their conceptual similarity to the Exemplar Text using a method based on thesaurally determined lexical chains. Initial results show that users prefer the Hesperus page order to MetaCrawler's statistical ordering. The technique consequently shows promise as a way to improve the effectiveness of Web searching.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting

With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...

متن کامل

Data Clustring Using A New CGA(Chaotic-Generic Algorithm) Approach

Clustering is the process of dividing a set of input data into a number of subgroups. The members of each subgroup are similar to each other but different from members of other subgroups. The genetic algorithm has enjoyed many applications in clustering data. One of these applications is the clustering of images. The problem with the earlier methods used in clustering images was in selecting in...

متن کامل

Data Clustring Using A New CGA(Chaotic-Generic Algorithm) Approach

Clustering is the process of dividing a set of input data into a number of subgroups. The members of each subgroup are similar to each other but different from members of other subgroups. The genetic algorithm has enjoyed many applications in clustering data. One of these applications is the clustering of images. The problem with the earlier methods used in clustering images was in selecting in...

متن کامل

SIMBA: An Extractive Multi-document Summarization System for Portuguese

This is a proposal for demonstration of simba in PROPOR 2012. simba is an extractive multi-document summarization system that aims at producing generic summaries guided by a compression rate defined by the user. It uses a double-clustering approach to find the relevant information in a set of texts. In addition, simba uses a sentence simplification procedure as a mean to ensure summary compress...

متن کامل

Examining the Generic Features of Thesis Acknowledgments: A Case of Iranian MA Graduate Students Majoring in Teaching to Speakers of Other Languages (AZFA) and TEFL

Thesis acknowledgement is a written genre in which MA graduate students offer their gratitude to individuals, who have contributed to the completion of their study. The aim of the current study was to examine the thesis acknowledgements written by Iranian MA students in the field of Persian Language Teaching to Non-Persian Speakers (Amouzeshe Zaban e Farsi be Kharejian, AZFA) and TEFL in terms ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004